home *** CD-ROM | disk | FTP | other *** search
- .uh "How to Boot Allspice"
- .pp
- I am ``Allspice'', Sprite's root file server. To boot after a power-up try
- .(l
- >b sd()new
- .)l
- to boot from disk. If this hangs or doesn't work for some reason, you may
- have to do a network boot from ginger:
- .(l
- >b ie(0,961c,43)sun4.md/new
- .)l
- If you get a ``phase error'' when booting off the disk, you need to
- reset the bus and try again. To reset the bus, try booting from a
- non-existent disk, e.g.,
- .(l
- >b sd(0,6)new
- .)l
- If you don't want allspice to be the root server, use the -backup flag:
- .(l
- >b ie(0,961c,43)sun4.md/new -backup
- .)l
- To reboot when running Sprite, use the shutdown command.
- .(l
- % sync
- % shutdown -R 'ie(0,961c,43)sun4.md/new
- .)l
- The ``sync'' command writes out the cache; it isn't required unless
- you are paranoid. Shutdown will sync the disks as the last thing before
- rebooting.
- .pp
- If Allspice is too wedged to get things done with user commands,
- then sync the disks with:
- .(l
- break-W
- .)l
- This should print a message about queuing a call to sync the disks,
- and when it is done it should print a ``.'' and a newline.
- If you don't get the newline then Allspice is deadlocked inside the
- file system cache, sigh.
- .pp
- (Note: on a regular Sun keyboard, this would be L1-W, and you'd use
- the L1 key like a shift key. On a regular ASCII terminal, like
- Allspice's console, you use the break key like escape: break then N.)
- .pp
- You can abort Allspice with:
- .(l
- break-A
- .)l
- And then use the boot command described above.
-
- .uh "Debugging Tips"
- .pp
- The current procedure for an Allspice crash is to take a core dump
- and then reboot.
- .pp
- If Allspice acts up then you might try the following things.
- If you aren't logged in, log in as root.
- Useful commands are:
- .(l
- allspice # rpcstat -srvr
- .)l
- Which dumps out the status of all the RPC server processes. If a bunch
- are ``busy'', and they remain busy with the same RPC ID and client, then
- there may be a deadlock.
- If they are all in the ``wait'' state it means that the Rpc_Daemon process
- is not doing rebinding for some reason.
- .(l
- allspice # ps -a
- .)l
- This will tell you if any important daemons have died. In particular,
- verify that arpd, ipServer, portmap, unfsd, inetd, tftpd, bootp, lpd,
- and sendmail are still around.
- If the ipServer is in the DEBUG state you can kill it and
- the daemons that depend on it with /hosts/allspice/restartIPServer.
- .(l
- % rpcecho -h \fIhostname\fP -n 1000
- .)l
- This program, which is found in /sprite/src/benchmarks/rpcecho,
- and may or may not be installed in /sprite/cmds,
- will tell you if there timeouts when using the RPC protocol to
- talk to another host. If you suspect that a host with an Intel
- ethernet interface is flaking out, you can try this command.
- Lot's of timeouts indicate trouble.
- You can reset a host's network interface from its console with
- .(l
- break-N
- .)l
- .pp
- If RPCs to Allspice are hanging but there's no obvious sign of
- trouble, the problem might be that the timer queue is wedged. To
- verify that this is the problem, type
- .(l
- break-T
- .)l
- This will give the current time (as a number) and the time that
- elements of the timer queue are supposed to be processed, sorted by
- increasing time. You may need to use ctrl-S to freeze the display
- (use ctrl-Q to unfreeze it). If the current time is greater than the
- earliest element in the timer queue, the timer is wedged and needs
- prodding. To prod the timer, type
- .(l
- break-A
- .)l
- to go to the PROM monitor, and then type ``c'' to continue back to
- Sprite. If this fails to unwedge the timer, you should reboot.
- .uh "Kernel Debugging"
- .pp
- If Allspice is so hung you can't explore with user commands,
- then the best you can do is sync the disks with:
- .(l
- break-W
- .)l
- Then throw Allspice into the debugger with:
- .(l
- break-D
- .)l
- If this drops you into the monitor (the '>' prompt), you can
- still get into the debugger by typing 'c' to the monitor.
- You may have to do this twice. You should eventually get
- a message about ``Entering the debugger...''.
- .pp
- You have to run the debugger from shallot or another sun4 unix machine, unless
- there is a stand-alone Sprite machine available. To login to shallot, you
- can use ginger's console, next to you, on top of
- ginger.
- You should verify that Allspice is accessible by running
- .(l
- ginger% kmsg -v allspice
- .)l
- This should return the kernel version that Allspice is running.
- If this times-out then either Allspice isn't in the debugger,
- or more likely, no one is responding to ARP requests for Allspice's
- IP address. Run the setup-arp script that is in ~sprite bin:
- .(l
- ginger% setup-arp
- .)l
- Now rlogin to shallot and run the Sprite kernel debugger.
- The kernel images should be copied to ginger:/home/ginger/sprite/kernels
- (visible as /home/ginger/sprite/kernels on shallot),
- and their version number should be
- evident in their name, e.g. sun4.1.065. If not, you can run
- strings on the kernel images and grep for ``VERSION''.
- .(l
- shallot% strings /tmp/sprite/sun4.sprite | egrep VERSION
- .)l
- To run the kernel debugger. (kgdb.sun4 is in ~sprite/cmds.sun4.)
- .(l
- shallot% cd /home/ginger/sprite/kernels
- shallot% Gdb sun4.\fIversion\fP
- .)l
- If the RPC system seems to be the problem, you can dump the
- trace of recent RPCs by calling Rpc_PrintTrace(numRecs)
- .(l
- (kgdb) print Rpc_PrintTrace(50)
- .)l
- If there is a deadlock you can dump the process table:
- .(l
- (kgdb) print Proc_Dump()
- .)l
- or, if you want to look at only waiting processes:
- .(l
- (kgdb) print Proc_KDump()
- .)l
- You can switch from process to process and to stack backtraces
- by using the 'pid' command. You only need to specify the last
- two hex digits of the process ID. If you only have a decimal ID,
- then you have to type the whole thing.
- File system deadlocks center around locked handles, usually.
- When you find a process stuck in Fsutil_HandleFetch of Fsutil_HandleLock
- you can try to find the culprit by looking at the *hdrPtr these
- guys are waiting on. There is a 'lockProcessID' in the hdrPtr that
- is really the address of a Proc_ControlBlock. You can print this out
- with something like:
- .(l
- (kgdb) print *(Proc_ControlBlock *)(hdrPtr->lockProcessID)
- .)l
- You can reboot Allspice from within kgdb with the reboot command.
- .(l
- (kgdb) reboot ie(0,9634)sun4.md/new
- .)l
- .uh "Taking a core dump"
- .pp
- Step 1) Make sure Allspice is in the debugger. If not, put it in the debugger.
- .pp
- Step 2) Login to ginger. Go to a file system with > 40 megabytes free
- space, e.g., /home/ginger/cores (now /export1/cores).
- .pp
- Step 2.5) You might need to run ~sprite/cmds.sun3/setup-arp for ginger to
- be able to talk with allspice.
- .pp
- Step 3) Run kgcore as follows: "~sprite/cmds.sun3/kgcore -v allspice"
- The -v is optional but I like it because it prints progress messages.
- Note that this step can take several (~ 5) minutes.
- .pp
- Step 4) Rename the file "vmcore" so something more meaningful such as
- "mv vmcore vmcore.allspice.crash.11-21".
- .pp
- Step 5) Reboot allspice. (~sprite/cmds.sun3/kmsg -R "sd()new" allspice)
- .uh "Modify date"
- .pp
- These notes were last updated by Mike Kupfer on \*(td.
-